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Abstract 

In deterministic optimization, line searches are a standard tool ensuring stability 
and efficiency. Where only stochastic gradients are available, no direct equivalent 
has so far been formulated, because uncertain gradients do not allow for a strict 
sequence of decisions collapsing the search space. We construct a probabilistic line 
search by combining the structure of existing deterministic methods with notions 
from Bayesian optimization. Our method retains a Gaussian process surrogate of 
the univariate optimization objective, and uses a probabilistic belief over the Wolfe 
conditions to monitor the descent. The algorithm has very low computational cost, 
and no user-controlled parameters. Experiments show that it effectively removes 
the need to define a learning rate for stochastic gradient descent. 


1 Introduction 

Stochastic gradient descent (SGD) m is currently the standard in machine learning for the optimization 
of highly multivariate functions if their gradient is corrupted by noise. This includes the online or 
batch training of neural networks, logistic regression lull and variational models [e.g. 01151161. In all 
these cases, noisy gradients arise because an exchangeable loss-function C{x) of the optimization 
parameters x G R^, across a large dataset {di}i=i is evaluated only on a subset 

^ M - m 

C{x) := — i{x, dj) =: C{x) m < M. (1) 

1=1 j=i 

If the indices j are i.i.d. draws from [1, M], by the Central Limit Theorem, the error C{x) — C{x) 
is unbiased and approximately normal distributed. Despite its popularity and its low cost per step, 
SGD has well-known deficiencies that can make it inefficient, or at least tedious to use in practice. 
Two main issues are that, first, the gradient itself, even without noise, is not the optimal search 
direction-, and second, SGD requires a step size (learning rate) that has drastic effect on the algorithm’s 
efficiency, is often difficult to choose well, and virtually never optimal for each individual descent 
step. The former issue, adapting the search direction, has been addressed by many authors [see|2l for 
an overview]. Existing approaches range from lightweight ‘diagonal preconditioning’ approaches 
like ADAGRAD and ‘stochastic meta-descent’ a, to empirical estimates for the natural gradient 
fTOl or the Newton direction ifTTl . to problem-specific algorithms (T2|, and more elaborate estimates 
of the Newton direction ES- Most of these algorithms also include an auxiliary adaptive effect on 
the learning rate. And Schaul et al. na recently provided an estimation method to explicitly adapt 
the learning rate from one gradient descent step to another. None of these algorithms change the 
size of the current descent step. Accumulating statistics across steps in this fashion requires some 
conservatism: If the step size is initially too large, or grows too fast, SGD can become unstable and 
‘explode’, because individual steps are not checked for robustness at the time they are taken. 
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distance t in line search direction 


Figure 1: Sketch: The task of a classic line search is to tune 
the step taken by a optimization algorithm along a univariate 
search direction. The search starts at the endpoint ® of the 
previous line search, at t = 0. A sequence of exponentially 
growing extrapolation steps finds a point of positive 

gradient at It is followed by interpolation steps (D,@ un¬ 
til an acceptable point @ is found. Points of insufficient 
decrease, above the line /(O) + Cif/'(0) (gray area) are ex¬ 
cluded by the Armijo condition W-I, while points of steep 
gradient (orange areas) are excluded by the curvature con¬ 
dition W-II (weak Wolfe conditions in solid orange, strong 
extension in lighter tone). Point @ is the first to fulfil both 
conditions, and is thus accepted. 


The principally same problem exists in deterministic (noise-free) optimization problems. There, 
providing stability is one of several tasks of the line search subroutine. It is a standard constituent of 
algorithms like the classic nonlinear conjugate gradient Ha and BFGS OSl El [TS] [HI methods ||20l 
§3]|^In the noise-free case, line searches are considered a solved problem li^ §3]. But the methods 
used in deterministic optimization are not stable to noise. They are easily fooled by even small 
disturbances, either becoming overly conservative or failing altogether. The reason for this brittleness 
is that existing line searches take a sequence of hard decisions to shrink or shift the search space. 
This yields efficiency, but breaks hard in the presence of noise. Sectionj^constructs a probabilistic 
line search for noisy objectives, stabilizing optimization methods like the works cited above. As 
line searches only change the length, not the direction of a step, they could be used in combination 
with the algorithms adapting SGD’s direction, cited above. The algorithm presented below is thus a 
complement, not a competitor, to these methods. 

2 Connections 

2.1 Deterministic Line Searches 

There is a host of existing line search variants ll20l §3]. In essence, though, these methods explore a 
univariate domain ‘to the right’ of a starting point, until an ‘acceptable’ point is reached (Figure 0. 
More precisely, consider the problem of minimizing C{x) : i>IR, with access to VC{xj : 

—i>IR^. At iteration i, some ‘outer loop’ chooses, at location Xi, a search direction Si G 
(e.g. by the BFGS rule, or simply Si = —VC{xi) for gradient descent). It will not be assumed that 
Si has unit norm. The line search operates along the univariate domain x{t) = Xi + tsi for t G K+. 
Along this direction it collects scalar function values and projected gradients that will be denoted 
f{t) = C{x{t)) and f'{t) = sJVC{x{t)) G K. Most line searches involve an initial extrapolation 
phase to find a point tj. with > 0. This is followed by a search in [0, tr], by interval nesting or 

by interpolation of the collected function and gradient values, e.g. with cubic splines]^ 

2.1.1 The Wolfe Conditions for Termination 

As the line search is only an auxiliary step within a larger iteration, it need not find an exact root 
of /'; it suffices to find a point ‘sufficiently’ close to a minimum. The Wolfe fQTjl conditions are a 
widely accepted formalization of this notion; they consider t acceptable if it fulfills 

(W-I) and f'{t) > 02 /'(0) (W-H), (2) 

using two constants 0 < Ci < C 2 < 1 chosen by the designer of the line search, not the user. W-I is 
the Armijo l l22l/ . or sufficient decrease condition. It encodes that acceptable functions values should 
lie below a linear extrapolation line of slope ci/'(0). W-II is the curvature condition, demanding 

*In these algorithms, another task of the line search is to guarantee certain properties of surrounding 
estimation rale. In BFGS, e.g., it ensures positive definiteness of the estimate. This aspect will not feature here. 

^This is the strategy in minimize.m by C. Rasmussen, which provided a model for our implementation. At 
the time of writing, it can be found at http : / /learning. eng. cam. ac . uk/carl/code/minimize/minimize . m 
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distance t in line search direction 


Figure 2: Sketch of a probabilistic line search. As in 
Fig. E the algorithm performs extrapolation (d),®,©) 
and interpolation but receives unreliable, noisy 

function and gradient values. These are used to con¬ 
struct a GP posterior (top. solid posterior mean, thin 
lines at 2 standard deviations, local pdf marginal as 
shading, three dashed sample paths). This implies a 
bivariate Gaussian belief ( p.3| l over the validity of the 
weak Wolfe conditions (middle three plots. Pa{t) is the 
marginal for W-I, pb(t) for W-II, pit) their correlation). 
Points are considered acceptable if their joint probabil¬ 
ity p'^°^^‘^{t) (bottom) is above a threshold (gray). An 
approximation (^ 3.3.1 1 to the strong Wolfe conditions 
is shown dashed. 


a decrease in slope. The choice Ci = 0 accepts any value below /(O), while ci = 1 rejects all 
points for convex functions. For the curvature condition, C 2 = 0 only accepts points with f'{t) > 0; 
while C 2 = 1 accepts any point of greater slope than /'(O). W-I and W-II are known as the weak 
form of the Wolfe conditions. The strong form replaces W-II with \f'{t)\ < C2|/'(0)| (W-IIa). This 
guards against accepting points of low function value but large positive gradient. Figure [T] shows a 
conceptual sketch illustrating the typical process of a line search, and the weak and strong Wolfe 
conditions. The exposition in |3.3| will initially focus on the weak conditions, which can be precisely 
modeled probabilistically. Section[3.3.1 then adds an approximate treatment of the strong form. 


2.2 Bayesian Optimization 

A recently blossoming sample-efficient approach to global optimization revolves around modeling 
the objective / with a probability measure p(/); usually a Gaussian process (GP). Searching for 
extrema, evaluation points are then chosen by a utility functional u[p{f)]. Our line search borrows 
the idea of a Gaussian process surrogate, and a popular utility, expected improvement ll23l . Bayesian 
optimization methods are often computationally expensive, thus ill-suited for a cost-sensitive task 
like a line search. But since line searches are governors more than information extractors, the kind of 
sample-efficiency expected of a Bayesian optimizer is not needed. The following sections develop a 
lightweight algorithm which adds only minor computational overhead to stochastic optimization. 


3 A ProbabUistic Line Search 


We now consider minimizing yit) = C{x{t)) from Eq. That is, the algorithm can access only 
noisy function values and gradients yt, y[ at location t, with Gaussian likelihood 


p{yt,y't\f) =^f 


yt 

y't 




(3) 


The Gaussian form is supported by the Central Limit argument at Eq. Q, see §3.4 regarding estimation 
of the variances cr^, CTj,. Our algorithm has three main ingredients: A robust yet lightweight Gaussian 
process surrogate on f{t) facilitating analytic optimization; a simple Bayesian optimization objective 
for exploration; and a probabilistic formulation of the Wolfe conditions as a termination criterion. 


3.1 Lightweight Gaussian Process Surrogate 

We model information about the objective in a probability measure p{f). There are two requirements 
on such a measure: First, it must be robust to irregularity of the objective. And second, it must allow 
analytic computation of discrete candidate points for evaluation, because a line search should not call 
yet another optimization subroutine itself Both requirements are fulfilled by a once-integrated Wiener 
process, i.e. a zero-mean Gaussian process prior p(/) = QV{f\ 0, k) with covariance function 

— 9^ [V3min^(f, ?) -f Y 2 |f — t'\ min^(f, ?)] • (4) 
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Here t := t + r and t' := t' + t denote a shift by a constant t > 0. This ensures this kernel is positive 
semi-definite, the precise value r is irrel evan t as the algorithm only considers positive values of t 
(our implementation uses r = 10). See ^3.4| regarding the scale 9^. With the likelihood of Eq. ([^, 
this prior gives rise to a GP posterior whose mean function is a cubic soling 1251 . We note in passing 
that regression on / and /' from N observations of pairs {yt, y't) can be formulated as a filter Il2^ 
and thus performed in 0{N) time. However, since a line search typically collects < 10 data points, 
generic GP inference, using a Gram matrix, has virtually the same, low cost. 

Because Gaussian measures are closed under linear maps lIZTl §10], Eq. Q implies a Wiener process 
(linear spline) model on /': 




f 

f 




with (using the indicator function I (a;) = 1 if a;, else 0) 






thus 


^k{t,t') = 


[I{t < t'Yli -f I{t > - ‘' 72 ) 

[l(f' < f)t '72 -f l{t' > - ‘ 72 )' 

min(f, f) 


(5) 


(6) 


Given a set of evaluations (f, y, y') (vectors, with elements fi, y*., t/j ) with independent likelihood 
the posterior p{f \ y, y') is a GP with posterior mean y and covariance and k as follows: 


y{t) 


ktt 

V 

'ktt + aji kYt 


y 

_^ktt 

1 

^ktt ^kYt + Y'I. 

) 

y' 


=-9Ht) 


k{t,t') = ktt' - 



(7) 


The posterior marginal variance will be denoted by V(f) = k(t, t). To see that y is indeed piecewise 
cubic (i.e. a cubic spline), we note that it has at most three non-vanishing derivative^ because 

^^k{t, t') = 9H{t < -1) ^^k^(t, t') = eH{t < t') 

t') = -eH{t < t') t') = 0 . ( 8 ) 

This piecewise cubic form of g is crucial for our purposes: having collected N values of / and 
/', respectively, all local minima of y, can be found analytically in 0{N) time in a single sweep 
through the ‘cells’ ti-i < t < ti, i = 1,..., N (here to — 0 denotes the start location, where {yo, y^) 
are ‘inherited’ from the preceding line search. Eor typical line searches N < 10, c.f. In each 
cell, y{t) is a cubic polynomial with at most one minimum in the cell, found by a trivial quadratic 
computation from the three scalars /i'(fi), y"(ti)^ y!"{ti). This is in contrast to other GP regression 
models—for example the one arising from a Gaussian kernel—which give more involved posterior 
means whose local minima can be found only approximately. Another advantage of the cubic spline 
interpolant is that it does not assume the existence of higher derivatives (in contrast to the Gaussian 
kernel, for example), and thus reacts robustly to irregularities in the objective. 

In our algorithm, after each evaluation of (t/jv, j/7)’ '^^is property to compute a short list 

of candidates for the next evaluation, consisting of the < N local minimizers of y{t) and one 
additional extrapolation node at f^nax + ct. where fmax is the currently largest evaluated t, and a is 
an extrapolation step size starting at a = 1 and doubled after each extrapolation step. 


3.2 Choosing Among Candidates 

The previous section described the construction of < N + 1 discrete candidate points for the next 
evaluation. To decide at which of the candidate points to actually call / and /', we make use of 
a popular utility from Bayesian optimization. Expected improvement ll2^ is the expected amount, 

3Eq.§ can be generalized to the ‘natural spline’, removing the need for the constant r 1241 §6.3.1]. However, 
this notion is ill-defined in the case of a single observation, which is crucial for the line search. 

‘'There is no well-defined probabilistic belief over /" and higher derivatives—sample paths of the Wiener 
process are almost surely non-differentiable almost everywhere l28l §2.2]. But y{t) is always a member of the 
reproducing kernel Hilbert space induced by k, thus piecewise cubic ED §6.1]. 
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Figure 3: Curated snapshots of line searches (from MNIST experiment, Q, showing variability of 
the objective’s shape and the decision process. Top row: GP posterior and evaluations, bottom row: 
approximate over strong Wolfe conditions. Accepted point marked red. 


under the GP surrogate, by which the function f{t) might be smaller than a ‘current best’ value rj (we 
set 7] = mini=o....,Af{//(fi)}. where ti are observed locations). 


UEi{t) = Ep(/^ I y y,) [min{0, rj - f{t)}] 


V - 

2 


^1 + erf 


V- Kt) '\ 



2Y{t) j ■ 


(9) 


The next evaluation point is chosen as the candidate maximizing this utility, multiplied by the 
probability for the Wolfe conditions to be fulhlled, which is derived in the following section. 


3.3 Probabilistic Wolfe Conditions for Termination 


The key observation for a probabilistic extension of W-1 and W-11 is that they are positivity constraints 
on two variables at, bt that are both linear projections of the (jointly Gaussian) variables / and /': 





r/(o)i 

at 


1 cit -1 0 


/'(O) 

bt_ 


0 -C2 0 1 


1 - 

1_ 





The GP of Eq. (|^ on / thus implies, at each value of t, a bivariate Gaussian distribution 


p{at,bt) =Af 


at 

bt 



'^aa X 

(jba fjib J . 


with = /r(0) — p(f) + Clip! {Q) and m!l = p! {t) — C 2 p'{Q) 

and = fcoo + (cif)^^fcoo + ^tt + 2[cit{k^Q - - kot] 

f^bb _ J2dtd _ Of. dud i did 

cf = cf" = -C2(fcoo + Cif^fcoo) + C 2 ^ht + - kff 


( 10 ) 


( 11 ) 

( 12 ) 

(13) 


The quadrant probability = p{at > 0 Abt > 0) for the Wolfe conditions to hold is an integral 
over a bivariate normal probability. 


Wolfe _ 

Pt — 


AA 




1 

Pt 


Pt 

1 


da db, 


(14) 


with correlation coefficient pt = Cf’j 


( . It can be computed efficiently 11291 . using readily 
available cod^(on a laptop, one evaluation of cost about 100 microseconds, each line search 
requires < 50 such calls). The line search computes this probability for all evaluation nodes, after 
each evaluation. If any of the nodes fulfills the Wolfe conditions with p^oife ^ greater than 
some threshold 0 < cw £ 1, it is accepted and returned. If s everal nodes simultaneously fuffill this 
requirement, the t of the lowest /r(f) is returned. Section 3.4 below motivates hxing cw = 0.3. 


e.g. http://WWW.math.wsu.edu/faculty/genz/software/matlab/bvn.m 


5 































































































3.3.1 Approximation for strong conditions: 


As noted in Section [2.1.1[ deterministic optimizers tend to use the strong Wolfe conditions, which 
use 1/^(0) I and A precise extension of these conditions to the probabilistic setting is numeri¬ 

cally taxing, because the distribution over |/'| is a non-central x-distribution, requiring customized 
computations. However, a straightforward variation to ( |14| l captures the spirit of the strong Wolfe 
conditions, that large positive derivatives should not be accepted: Assuming /'(O) < 0 (i.e. that the 
search direction is a descent direction), the strong second Wolfe condition can be written exactly as 

0 < 6* = fit) - C2/(0) < -2c2/'(0). (15) 

The value —2c2f'(0) is bounded to 95% conhdence by 

- 2 c 2 /'( 0 ) < - 2 c 2 (|/r'( 0 )| + 2xA^) =: b. (16) 


Hence, an approximation to the st rong Wolfe conditions can be reached by replacing the infinite 
upper integration limit on b in Eq. (14 1 with (6 — m\)l\/Cf. The effect of this adaptation, which 
adds no overhead to the computation, is shown in Figure]^ as a dashed line. 


3.4 Eliminating Hyper-parameters 

As a black-box inner loop, the line search should not require any tuning by the user. The preceding 
section introduced six so-far undefined parameters: ci, C 2 , cw, b,af,af'. We will now show that 
Cl, C 2 , C\Y, can be fixed by hard design decisions. 9 can be eliminated by standardizing the opti¬ 
mization objective within the line search; and the noise levels can be estimated at runtime with low 
overhead for batch objectives of the form in Eq. ([T]|. The result is a parameter-free algorithm that 
effectively removes the one most problematic parameter from SGD —the learning rate. 


Design Parameters Ci, C 2 , cw Our algorithm inherits the Wolfe thresholds ci and C 2 from its 
deterministic ancestors. We set ci = 0.05 and C 2 = 0.8. This is a standard setting that yields a 
‘lenient’ line search, i.e. one that accepts most descent points. The rationale is that the stochastic 
aspect of SGD is not always problematic, but can also be helpful through a kind of ‘annealing’ effect. 

The acceptance threshold cw is a new design parameter arising only in the probabilistic setting. We 
hx it to Cw = 0.3. To motivate this value, hrst note that in the noise-free limit, all values 0 < cw < 1 
are equivalent, because then switches discretely between 0 and 1 upon observation of the 

function. A back-of-the-envelope computation (left out for space), assuming only two evaluations 
at f = 0 and t = ti and the same fixed noise level on / and f (which then cancels out), shows 
that function values barely fulfilling the conditions, i.e. at^ — = 0, can have 0.2 while 

function values at otj = bt^ = —e for e — 1 > 0 with ‘unlucky’ evaluations (both function and gradient 
values one standard-deviation from true value) can achieve 0.4. The choice cw = 0.3 

balances the two competing desiderata for precision and recall. Empirically (Fig. [^, we rarely 
observed values of close to this threshold. Even at high evaluation noise, a function evaluation 
typically either clearly rules out the Wolfe conditions, or lifts well above the threshold. 

Scale 6 The parameter 9 of Eq. Q simply scales the prior variance. It can be eliminated by scaling 
the optimization objective: We set 6* = 1 and scale <— y) <— y'i/\vo\ within the code of 

the line search. This gives y(0) = 0 and y'(0) = —1, and typically ensures the objective ranges in 
the single digits across 0 < f < 10, where most line searches take place. The division by |yg| causes 
a non-Gaussian disturbance, but this does not seem to have notable empirical effect. 

Noise Scales crf,o-f' The likelihood ([^ requires standard deviations for the noise on both function 
values (cTf) and gradients (crf>). One could attempt to learn these across several line searches. 
However, in exchangeable models, as captured by Eq. the variance of the loss and its gradient 
can be estimated directly within the batch, at low computational overhead—an approach already 
advocated by Schaul et al. lEl. We collect the empirical statistics 

^ m ^ m 

Six) :=—'^fix,yj), and VS'(a:) := — ^ Vf(x, (17) 

j 3 
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(where denotes the element-wise square) and estimate, at the beginning of a line search from Xk, 




1 


m — 1 


(^S{xk) - L{xk)‘ 


and 


aj, 


2T 


1 


771—1 


(wSixk) - (V£).' 


( 18 ) 

This amounts to the cautious assumption t hat n oise on the gradient is independent. We finally scale 
the two empirical estimates as described in (3.4 af <— af/\y'{0)\, and ditto for cr//. The overhead of 


this estimation is small if the computation ot i{x, yj ) itself is more expensive than the summation over 
j (in the neural network examples of (Q with their comparably simple i, the additional steps added 
only ~ 1% cost overhead to the evaluation of the loss). Of course, this approach requires a batch size 
777 > 1. For single-sample batches, a running averaging could be used instead (single-sample batches 
are not necessarily a good choice. In our experiments, for example, vanilla SGD with batch size 10 
converged faster in wall-clock time than unit-batch SGD). Estimating noise separately for each input 
dimension captures the often inhomogeneous structure among gradient elements, and its effect on the 
noise along the projected direction. For example, in deep models, gradient noise is typically higher 
on weights between the input and first hidden layer, hence line searches along the corresponding 
directions are noisier than those along directions affecting higher-level weights. 


3.4.1 Propagating Step Sizes Between Line Searches 

As will be demonstrated in ^ the line search can find good step sizes even if the length of the 
direction Si (which is proportional to the learning rate a in SGD) is mis-scaled. Since such scale 
issues typically persist over time, it would be wasteful to have the algorithm re-fit a good scale in each 
line search. Instead, we propagate step lengths from one iteration of the search to another: We set the 
initial search direction to sq = —ctoVC{x[)) with some initial learning rate ao. Then, after each line 
search ending at Xi = Xi-i + t^Si, the next search direction is set to s^+i = —1.3 • t^,aoWC{xi). 
Thus, the next line search starts its extrapolation at 1.3 times the step size of its predecessor. 


Remark on convergence of SGD with line searches: We note in passing that it is straightforward 
to ensure that SGD instances using the line search inherit the convergence guarantees of SGD: 
Putting even an extremely loose bound ai on the step sizes taken by the z-th line search, such that 
ai = oo and ^ ensures the line search-controlled SGD converges in probability |[T]|. 


4 Experiments 

Our experiments were performed on the well-worn problems of training a 2-layer neural net with 
logistic nonlinearity on the MNIST and CIFAR-10 datasetsj^In both cases, the network had 800 hid¬ 
den units, giving optimization problems with 636 010 and 2 466 410 parameters, respectively. While 
this may be Tow-dimensionaT by contemporary standards, it exhibits the stereotypical challenges 
of stochastic optimization for machine learning. Since the line search deals with only univariate 
subproblems, the extrinsic dimensionality of the optimization task is not particularly relevant for an 
empirical evaluation. Leaving aside the cost of the function evaluations themselves, computation cost 
associated with the line search is independent of the extrinsic dimensionality. 

The central nuisance of SGD is having to choose the learning rate a, and potentially also a schedule for 
its decrease. Theoretically, a decaying learning rate is necessary to guarantee convergence of SGD m, 
but empirically, keeping the rate constant, or only decaying it cautiously, often work better (Fig.|^. In 
a practical setting, a user would perform exploratory experiments (say, for 10^ steps), to determine a 
good learning rate and decay schedule, then run a longer experiment in the best found setting. In our 
networks, constant learning rates of a = 0.75 and a = 0.08 for MNIST and CIFAR-10, respectively, 
achieved the lowest test error after the first 10^ steps of SGD. We then trained networks with vanilla 
SGD with and without a-decay (using the schedule a{i) = ao/z), and SGD using the probabilistic 
line search, with ao ranging across five orders of magnitude, on batches of size m = 10. 

Fig. a top, shows test errors after 10 epochs as a function of the initial learning rate ao (error bars 
based on 20 random re-starts). Across the broad range of ao values, the line search quickly identified 
good step sizes a(f), stabilized the training, and progressed efficiently, reaching test errors similar 

^http: //yann . lecun. com/exdb/mnist/ and http://www.cs.toronto.edu/~k.riz/cifar.html Like other au¬ 
thors, we only used the “batch 1” sub-set of CIFAR-10. 
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Figure 4: Top row: test error after 10 epochs as function of initial learning rate (note logarithmic 
ordinate for MNIST). Bottom row: Test error as function of training epoch (same color and symbol 
scheme as in top row). No matter the initial learning rate, the line search-controlled SGD perform 
close to the (in practice unknown) optimal SGD instance, effectively removing the need for exploratory 
experiments and learning-rate tuning. All plots show means and 2 std.-deviations over 20 repetitions. 


to those reported in the literature for tuned versions of this kind of architecture on these datasets. 
While in both datasets, the best SGD instance without rate-decay just barely outperformed the line 
searches, the optimal a value was not the one that performed best after 10^ steps. So this kind of 
exploratory experiment (which comes with its own cost of human designer time) would have led to 
worse performance than simply starting a single instance of SGD with the linesearch and ag = 1, 
letting the algorithm do the rest. 

Average time overhead (i.e. excluding evaluation-time for the objective) was about 48ms per line 
search. This is independent of the problem dimensionality, and expected to drop significantly with 
optimized code. Analysing one of the MNIST instances more closely, we found that the average 
length of a line search was ~ 1.4 function evaluations, 80% — 90% of line searches terminated 
after the first evaluation. This suggests good scale adaptation and thus efficient search (note that an 
‘optimally tuned’ algorithm would always lead to accepts). 

The supplements provide additional plots, of raw objective values, chosen step-sizes, encountered 
gradient norms and gradient noises during the optimization, as well as test-vs-train error plots, for each 
of the two datasets, respectively. These provide a richer picture of the step-size control performed by 
the line search. In particular, they show that the line search chooses step sizes that follow a nontrivial 
dynamic over time. This is in line with the empirical truism that SGD requires tuning of the step size 
during its progress, a nuisance taken care of by the line search. Using this structured information for 
more elaborate analytical purposes, in particular for convergence estimation, is an enticing prospect, 
but beyond the scope of this paper. 

5 Conclusion 

The line search paradigm widely accepted in deterministic optimization can be extended to noisy 
settings. Our design combines existing principles from the noise-free case with ideas from Bayesian 
optimization, adapted for efficiency. We arrived at a lightweight “black-box” algorithm that exposes 
no parameters to the user. Our method is complementary to, and can in principle be combined with, 
virtually all existing methods for stochastic optimization that adapt a step direction of fixed length. 
Empirical evaluations suggest the line search effectively frees users from worries about the choice of 
a learning rate: Any reasonable initial choice will be quickly adapted and lead to close to optimal 
performance. Our matlab implementation can be found at http;//tinyurl.com/probLineSearch 


8 


















































































































































References 

[1] H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 
22(3):400^07, Sep. 1951. 

[2] T. Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In 
Twenty-first International Conference on Machine Learning (ICML 2004), 2004. 

[3] L. Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of the 19th Int. 
Conf. on Computational Statistic (COMPSTAT), pages 177-186. Springer, 2010. 

[4] M.D. Hoffman, D.M. Blei, C. Wang, and J. Paisley. Stochastic variational inference. Journal of Machine 
Learning Research, 14(1):1303-1347, 2013. 

[5] J. Hensman, M. Rattray, and N.D. Lawrence. Fast variational inference in the conjugate exponential family. 
\n Advances in Neural Information Processing Systems (NIPS 25), pages 2888-2896, 2012. 

[6] T. Broderick, N. Boyd, A. Wibisono, A.C. Wilson, and M.l. Jordan. Streaming variational Bayes. In 
Advances in Neural Information Processing Systems (NIPS 26), pages 1727-1735, 2013. 

[7] A.P. George and W.B. Powell. Adaptive stepsizes for recursive estimation with applications in approximate 
dynamic programming. Machine Learning, 65(1):167-198, 2006. 

[8] J. Duchi, E. Kazan, and Y. Singer. Adaptive suhgradient methods for online learning and stochastic 
optimization. Journal of Machine Learning Research, 12:2121-2159, 2011. 

[9] N.N. Schraudolph. Local gain adaptation in stochastic gradient descent. In Ninth International Conference 
on Artificial Neural Networks (ICANN) 99, volume 2, pages 569-574, 1999. 

[10] S.-l. Amari, H. Park, and K. Fukumizu. Adaptive method of realizing natural gradient learning for 
multilayer perceptrons. Neural Computation, 12(6): 1399-1409, 2000. 

[11] N.L. Roux and A.W. Fitzgibbon. A fast natural Newton method. In 27th International Conference on 
Machine Learning (ICML), pages 623-630, 2010. 

[12] R. Rajesh, W. Chong, D. Blei, and E. Xing. An adaptive learning rate for stochastic variational inference. 
In 30th International Conference on Machine Learning (ICML), pages 298-306, 2013. 

[13] P. Hennig. Fast Probabilistic Optimization from Noisy Gradients. In 30th International Conference on 
Machine Learning (ICML), 2013. 

[14] T. Schaul, S. Zhang, and Y. LeCun. No more pesky learning rates. In 30th International Conference on 
Machine Learning (ICML-13), pages 343-351, 2013. 

[15] R. Fletcher and C.M. Reeves. Function minimization hy conjugate gradients. The Computer Journal, 
7(2): 149-154, 1964. 

[16] C.G. Broyden. A new double-rank minimization algorithm. Notices of the AMS, 16:670, 1969. 

[17] R. Fletcher. A new approach to variable metric algorithms. The Computer Journal, 13(3):317, 1970. 

[18] D. Goldfarb. A family of variable metric updates derived by variational means. Math. Comp., 24(109):23- 
26, 1970. 

[19] D.F Shanno. Conditioning of quasi-Newton methods for function minimization. Math. Comp., 24(111):647- 
656, 1970. 

[20] J. Nocedal and S.J. Wright. Numerical Optimization. Springer Verlag, 1999. 

[21] P. Wolfe. Convergence conditions for ascent methods. SIAM Review, pages 226-235, 1969. 

[22] L. Armijo. Minimization of functions having Lipschitz continuous first partial derivatives. Pacific Journal 
of Mathematics, 16(1): 1-3, 1966. 

[23] D.R. Jones, M. Schonlau, and W.J. Welch. Efficient global optimization of expensive black-box functions. 
Journal of Global Optimization, 13(4):455^92, 1998. 

[24] C.E. Rasmussen and C.K.l. Williams. Gaussian Processes for Machine Learning. MIT, 2006. 

[25] G. Wahba. Spline models for observational data. Number 59 in CBMS-NSF Regional Conferences series 
in applied mathematics. SIAM, 1990. 

[26] S. Sarkka. Bayesian filtering and smoothing. Cambridge University Press, 2013. 

[27] A. Papoulis. Probability, Random Variables, and Stochastic Processes. McGraw-Hill, New York, 3rd ed. 
edition, 1991. 

[28] R.J. Adler. The Geometry of Random Fields. Wiley, 1981. 

[29] Z. Drezner and G.O. Wesolowsky. On the computation of the bivariate normal integral. Journal of 
Statistical Computation and Simulation, 35(1-2):101-107, 1990. 


9 



Supplements 

This supplementary document contains additional results of the experiments described in the main 
paper, using the probabilistic line search algorithm to control the learning rate in stochastic gradient 
descent during training of two-layer neural network architectures in the ClFAR-10 and MNIST 
datasets. 

1 Evolution of Function Values 

Figure [ijplots the evolution of encountered raw function values against function evaluations. Each 
function call evaluates the gradient on a batch of size 10, both for SGD with constant and decaying 
learning rate, and for the line search-enhanced SGD. To keep the plot readable, the plot lines have 
been smoothed with a windowed running average, and only plotted at logarithmically spaced points. 
Among the noteworthy features of these plots is that SGD with large step sizes can be unstable 
(divergent dashed black lines), while this instability is caught and controlled by the line search. 
Regardless of initial step size, all line search-controlled instances perform very similarly, and reach 
close to optimal performance. Over the dynamic development of the optimization process, some 
specific choices of step size temporarily perform better than the line search-controlled instances, but 
this advantage slims or vanishes over time, because no fixed step size is optimal over the course of 
the entire optimization process. As mentioned in the main paper, finding those optimal step sizes 
would normally involve a tedious, costly search, which the line search effecively removes. 

2 Optimal Step Sizes Vary During Training 

It is a “widely known empirical fact” that SGD instances require a certain amount of run-time 
“tweaking”, because the optimal step size depends not just on the local structure of the objective, but 
also on batch-size (and associatd noise level). Figurej^shows accepted step sizes, initial gradients at 
each search, and estimated gradient noise levels for the line search instances in the same experimental 
runs described above (smoothed and thinned as in Fig.[^above). Starting five orders of magnitude 
apart, the line searches very quickly converged to similar step sizes; and indeed eventually settle 
around the empirically optimal value of a = 0.075 (MNIST) and a = 0.08 (CIFAR-10) (dashed 
green horizontal line in Fig.|^. But step sizes varied over time: starting out small, they then increased, 
and began decreasing again after around 10^ line searches. This corroborates the empirical truism 
that learning rates should not immediately start decreasing, and only do so slowly. Interestingly, 
while there is an association between gradient values, noise and accepted step sizes, there appears to 
be no simple analytic relationship between the three. Overall, the emerging picture is that there is 
indeed nontrivial structure in the objective that is picked up by the line search. 

3 Line Searches Do Not Affect Overfitting 

A final worry one might have is that the control interventions of the line search might curtail an 
“accidental” benefical property of SGD—for example that the somewhat erratic, stochastic steps 
caused by stochasticity in the gradients allow SGD to “jump over” local minima of the objective. Such 
local minima can be a cause of over-fitting, or generally of low empirical performance. The plots 
in the main paper already confirm that the line search does not cause a stagnation in optimization 
performance, and can indeed improve this performance drastically. For completeness. Figure 
also shows the relation between encountered train- and test-set error rates over the course of the 
optimization and Figure|^shows the evolution of the train-set error per epoch as well as its dependence 
on the initial learning rate (same symbols and colors as in Figure 4 of main paper). There is generally 
little over-fitting in MNIST, and fairly strong over-fitting in CIFAR-10. But the instances controlled 
by the line search (circles) do not show a noticeably different behaviour, in this regard, to the 
uncontrolled, diffusive SGD instances. This suggests that, when the line search intervenes to curtail 
steps of SGD, it does so typically to prevent a truly sub-optimal step, rather than a beneficial “hop” 
over the walls surrounding a local minimum. 
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CIFARIO 21ayer neural net MNIST 21ayer neural net 




Figure 1; Function values encountered during neural network experiments. Vanilla SGD at different 
fixed learning rate a as dashed lines, SGD at different decaying learning rates as dashed-dotted lines. 
Solid lines show results using the probabilistic line search initialized at the corresponding a-values. 



Figure 2: Same colors as Fig. 4 in the main paper: Accepted step sizes a, initial gradients |/q| and 
absolute noise on gradient af. The probabilistic line search quickly fixes even wildly varying initial 
learning rates, within the first few line searches. The accepted step lengths drift over the course of the 
optimization, in a nontrivial relation to noise levels and gradient norms. 
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CIFARIO 21ayer neural net MNIST 21ayer neural net 




Figure 3: Test set error rates plotted against training set error rates. Same symbols and colors as 
in Figure 4 of the main paper. While there is significant over-fitting in CIFAR-10, and virtually no 
over-htting in the MNIST case, the line search-controlled instances of SGD perform similarly to the 
best SGD instances. 


CIFARIO 21ayer neural net 



MNIST 2Iayer neural net 



10 “^ 10 “^ 10 “^ 10 “'^ 10 “ 10 ^^ 
intial learning rate 



epoch 


epoch 


Figure 4: Top row: train error after 10 epochs as function of initial learning rate (note logarithmic 
ordinate for MNIST). Bottom row: Train error as function of training epoch (same color and symbol 
scheme as in top row). Same symbols and colors as in Figure 4 of the main paper. No matter the 
initial learning rate, the line search-controlled SGD perform close to the (in practice unknown) optimal 
SGD instance, effectively removing the need for exploratory experiments and learning-rate tuning. 
All plots show means and 2 std.-deviations over 20 repetitions. 
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